Skip to content

feat(mothership): implement auto-provisioning with manifest#136

Merged
deanq merged 77 commits intomainfrom
deanq/ae-1660-mothership-deploys-manifest
Jan 14, 2026
Merged

feat(mothership): implement auto-provisioning with manifest#136
deanq merged 77 commits intomainfrom
deanq/ae-1660-mothership-deploys-manifest

Conversation

@deanq
Copy link
Copy Markdown
Member

@deanq deanq commented Jan 9, 2026

Summary

Implement automatic child endpoint provisioning when the Mothership (LoadBalancerSlsResource) boots up. The mothership reads the local manifest, reconciles with State Manager's persisted manifest, deploys/updates/deletes child resources accordingly, sets FLASH_MOTHERSHIP_URL on each child, and serves a /manifest endpoint for service discovery.

Key Features:

  • Background provisioning task (non-blocking) for fast cold starts
  • Intelligent reconciliation: deploy new, update changed, delete removed resources
  • Skips LoadBalancer resources during provisioning (avoids self-deployment)
  • Idempotent provisioning using config hashes to detect changes
  • State Manager integration for persistent manifest state across boots

What's Included

Core Implementation

  • src/tetra_rp/runtime/mothership_provisioner.py - Main provisioning logic with manifest reconciliation
  • src/tetra_rp/runtime/state_manager_client.py - HTTP client for State Manager API
  • /manifest endpoint in LB handler for service discovery

Deployment Configuration

  • LoadBalancerSlsResource sets FLASH_IS_MOTHERSHIP=true env var
  • Mothership URL constructed from RUNPOD_ENDPOINT_ID
  • Manifest file (flash_manifest.json) loaded during boot

Comprehensive Tests

  • Unit tests for provisioner functions (reconciliation, URL construction, etc.)
  • Integration tests for end-to-end provisioning workflow
  • Resource drift detection tests for manifest reconciliation

Documentation

  • Updated docs/Cross_Endpoint_Routing.md with architecture details
  • Fixed terminology inconsistencies (Directory → Manifest)

Bug Fixes

  • Fixed critical endpoint bug in manifest_client.py (was querying /directory, now queries /manifest)
  • Updated exception references throughout codebase

Testing

  • All tests passing
  • Quality checks: format, lint, type checking all passing

Related Issues

  • AE-1660: Mothership auto-provisioning implementation

deanq added 30 commits January 3, 2026 01:22
Implement a factory function that creates RunPod serverless handlers,
eliminating code duplication across generated handler files.

The generic_handler module provides:
- create_handler(function_registry) factory that accepts a dict of
  function/class objects and returns a RunPod-compatible handler
- Automatic serialization/deserialization using cloudpickle + base64
- Support for both function execution and class instantiation + method calls
- Structured error responses with full tracebacks for debugging
- Load manifest for cross-endpoint function discovery

This design centralizes all handler logic in one place, making it easy to:
- Fix bugs once, benefit all handlers
- Add new features without regenerating projects
- Keep deployment packages small (handler files are ~23 lines each)

Implementation:
- deserialize_arguments(): Base64 + cloudpickle decoding
- serialize_result(): Cloudpickle + base64 encoding
- execute_function(): Handles function vs. class execution
- load_manifest(): Loads flash_manifest.json for service discovery
…uild process

Implement the build pipeline components that work together to generate
serverless handlers from @Remote decorated functions.

Three core components:

1. RemoteDecoratorScanner (scanner.py)
   - Uses Python AST to discover all @Remote decorated functions
   - Extracts function metadata: name, module, async status, is_class
   - Groups functions by resource_config for handler generation
   - Handles edge cases like decorated classes and async functions

2. ManifestBuilder (manifest.py)
   - Groups functions by their resource_config
   - Creates flash_manifest.json structure for service discovery
   - Maps functions to their modules and handler files
   - Enables cross-endpoint function routing at runtime

3. HandlerGenerator (handler_generator.py)
   - Creates lightweight handler_*.py files for each resource config
   - Each handler imports functions and registers them in FUNCTION_REGISTRY
   - Handler delegates to create_handler() factory from generic_handler
   - Generated handlers are ~23 lines (vs ~98 with duplication)

Build Pipeline Flow:
1. Scanner discovers @Remote functions
2. ManifestBuilder groups them by resource_config
3. HandlerGenerator creates handler_*.py for each group
4. All files + manifest bundled into archive.tar.gz

This eliminates ~95% duplication across handlers by using the factory pattern
instead of template-based generation.
Implement 19 unit tests covering all major paths through the generic_handler
factory and its helper functions.

Test Coverage:

Serialization/Deserialization (7 tests):
- serialize_result() with simple values, dicts, lists
- deserialize_arguments() with empty, args-only, kwargs-only, mixed inputs
- Round-trip encoding/decoding of cloudpickle + base64

Function Execution (4 tests):
- Simple function execution with positional and keyword arguments
- Keyword argument handling
- Class instantiation and method calls
- Argument passing to instance methods

Handler Factory (8 tests):
- create_handler() returns callable RunPod handler
- Handler with simple function registry
- Missing function error handling (returns error response, not exception)
- Function exceptions caught with traceback included
- Multiple functions in single registry
- Complex Python objects (classes, lambdas, closures)
- Empty registry edge case
- Default execution_type parameter
- None return values
- Correct RunPod response format (success, result/error, traceback)

Test Strategy:
- Arrange-Act-Assert pattern for clarity
- Isolated unit tests (no external dependencies)
- Tests verify behavior, not implementation
- Error cases tested for proper error handling
- All serialization tested for round-trip correctness

All tests passing, 83% coverage on generic_handler.py
…canning

Implement integration tests validating the build pipeline components work
correctly together.

Test Coverage:

HandlerGenerator Tests:
- Handler files created with correct names (handler_<resource_name>.py)
- Generated files import required functions from workers
- FUNCTION_REGISTRY properly formatted
- create_handler() imported from generic_handler
- Handler creation via factory
- RunPod start call present and correct
- Multiple handlers generated for multiple resource configs

ManifestBuilder Tests:
- Manifest structure with correct version and metadata
- Resources grouped by resource_config
- Handler file paths correct
- Function metadata preserved (name, module, is_async, is_class)
- Function registry mapping complete

ScannerTests:
- @Remote decorated functions discovered via AST
- Function metadata extracted correctly
- Module paths resolved properly
- Async functions detected
- Class methods detected
- Edge cases handled (multiple decorators, nested classes)

Test Strategy:
- Integration tests verify components work together
- Tests verify generated files are syntactically correct
- Tests validate data structures match expected schemas
- No external dependencies in build process

Validates that the entire build pipeline:
1. Discovers functions correctly
2. Groups them appropriately
3. Generates valid Python handler files
4. Creates correct manifest structure
Add comprehensive architecture documentation explaining why the factory
pattern was chosen and how it works.

Documentation includes:

Overview & Context:
- Problem statement: Handler files had 95% duplication
- Design decision: Use factory function instead of templates
- Benefits: Single source of truth, easier maintenance, consistency

Architecture Diagrams (MermaidJS):
- High-level flow: @Remote functions → Scanner → Manifest → Handlers → Factory
- Component relationships: HandlerGenerator, GeneratedHandler, generic_handler
- Function registry pattern: Discovery → Grouping → Registration → Factory

Implementation Details:
- create_handler(function_registry) signature and behavior
- deserialize_arguments(): Base64 + cloudpickle decoding
- serialize_result(): Cloudpickle + base64 encoding
- execute_function(): Function vs. class execution
- load_manifest(): Service discovery via flash_manifest.json

Design Decisions (with rationale):
- Factory Pattern over Inheritance: Simpler, less coupling, easier to test
- CloudPickle + Base64: Handles arbitrary objects, safe JSON transmission
- Manifest in Generic Handler: Runtime service discovery requirement
- Structured Error Responses: Debugging aid, functional error handling
- Both Execution Types: Supports stateful classes and pure functions

Usage Examples:
- Simple function handler
- Class execution with methods
- Multiple functions in one handler

Build Process Integration:
- 4-phase pipeline: Scanner → Grouping → Generation → Packaging
- Manifest structure and contents
- Generated handler structure (~23 lines)

Testing Strategy:
- 19 unit tests covering all major paths
- 7 integration tests verifying handler generation
- Manual testing with example applications

Performance:
- Zero runtime penalty (factory called once at startup)
- No additional indirection in request path
Document the flash build command and update CLI README to include it.

New Documentation:

flash-build.md includes:

Usage & Options:
- Command syntax: flash build [OPTIONS]
- --no-deps: Skip transitive dependencies (faster, smaller archives)
- --keep-build: Keep build directory for inspection/debugging
- --output, -o: Custom archive name (default: archive.tar.gz)

What It Does (5-step process):
1. Discovery: Scan for @Remote decorated functions
2. Grouping: Group functions by resource_config
3. Handler Generation: Create lightweight handler files
4. Manifest Creation: Generate flash_manifest.json
5. Packaging: Create archive.tar.gz for deployment

Build Artifacts:
- .flash/archive.tar.gz: Deployment package (ready for RunPod)
- .flash/flash_manifest.json: Service discovery configuration
- .flash/.build/: Temporary build directory

Handler Generation:
- Explains factory pattern and minimal handler files
- Links to Runtime_Generic_Handler.md for details

Dependency Management:
- Default behavior: Install all dependencies including transitive
- --no-deps: Only direct dependencies (when base image has transitive)
- Trade-offs explained

Cross-Endpoint Function Calls:
- Example showing GPU and CPU endpoints
- Manifest enables routing automatically

Output & Troubleshooting:
- Sample build output with progress indicators
- Common failure scenarios and solutions
- How to debug with --keep-build

Next Steps:
- Test locally with flash run
- Deploy to RunPod
- Monitor with flash undeploy list

Updated CLI README.md:
- Added flash build to command list in sequence
- Links to full flash-build.md documentation
Add a new section explaining how the build system works and why the
factory pattern reduces code duplication.

New Section: Build Process and Handler Generation

Explains:

How Flash Builds Your Application (5-step pipeline):
1. Discovery: Scans code for @Remote decorated functions
2. Grouping: Groups functions by resource_config
3. Handler Generation: Creates lightweight handler files
4. Manifest Creation: Generates flash_manifest.json for service discovery
5. Packaging: Bundles everything into archive.tar.gz

Handler Architecture (with code example):
- Shows generated handler using factory pattern
- Single source of truth: All handler logic in one place
- Easier maintenance: Bug fixes don't require rebuilding projects

Cross-Endpoint Function Calls:
- Example of GPU and CPU endpoints calling each other
- Manifest and runtime wrapper handle service discovery

Build Artifacts:
- .flash/.build/: Temporary build directory
- .flash/archive.tar.gz: Deployment package
- .flash/flash_manifest.json: Service configuration

Links to detailed documentation:
- docs/Runtime_Generic_Handler.md for architecture details
- src/tetra_rp/cli/docs/flash-build.md for CLI reference

This section bridges the main README and detailed documentation,
providing entry point for new users discovering the build system.
Wire up the handler generator, manifest builder, and scanner into the
actual flash build command implementation.

Changes to build.py:

1. Integration:
   - Import RemoteDecoratorScanner for function discovery
   - Import ManifestBuilder for manifest creation
   - Import HandlerGenerator for handler file creation
   - Call these in sequence during the build process

2. Build Pipeline:
   - After copying project files, scan for @Remote functions
   - Build manifest from discovered functions
   - Generate handler files for each resource config
   - Write manifest to build directory
   - Progress indicators show what's being generated

3. Fixes:
   - Change .tetra directory references to .flash
   - Uncomment actual build logic (was showing "Coming Soon" message)
   - Fix progress messages to show actual file counts

4. Error Handling:
   - Try/catch around handler generation
   - Warning shown if generation fails but build continues
   - User can debug with --keep-build flag

Build Flow Now:
1. Load ignore patterns
2. Collect project files
3. Create build directory
4. Copy files to build directory
5. [NEW] Scan for @Remote functions
6. [NEW] Build and write manifest
7. [NEW] Generate handler files
8. Install dependencies
9. Create archive
10. Clean up build directory (unless --keep-build)

Dependencies:
- Updated uv.lock with all required dependencies
…handling

**Critical Fixes:**
- Remove "Coming Soon" message blocking build command execution
- Fix build directory to use .flash/.build/ directly (no app_name subdirectory)
- Fix tarball to extract with flat structure using arcname="."
- Fix cleanup to remove correct build directory

**Error Handling & Validation:**
- Add specific exception handling (ImportError, SyntaxError, ValueError)
- Add import validation to generated handlers
- Add duplicate function name detection across resources
- Add proper error logging throughout build process

**Resource Type Tracking:**
- Add resource_type field to RemoteFunctionMetadata
- Track actual resource types (LiveServerless, CpuLiveServerless)
- Use actual types in manifest instead of hardcoding

**Robustness Improvements:**
- Add handler import validation post-generation
- Add manifest path fallback search (cwd, module dir, legacy location)
- Add resource name sanitization for safe filenames
- Add specific exception logging in scanner (UnicodeDecodeError, SyntaxError)

**User Experience:**
- Add troubleshooting section to README
- Update manifest path documentation in docs
- Change "Zero Runtime Penalty" to "Minimal Runtime Overhead"
- Mark future enhancements as "Not Yet Implemented"
- Improve build success message with next steps

Fixes all 20 issues identified in code review (issues #1-13, #19-22)
Implement LoadBalancerSlsResource class for provisioning RunPod load-balanced
serverless endpoints. Load-balanced endpoints expose HTTP servers directly to
clients without queue-based processing, enabling REST APIs, webhooks, and
real-time communication patterns.

Key features:
- Type enforcement (always LB, never QB)
- Scaler validation (REQUEST_COUNT required, not QUEUE_DELAY)
- Health check polling via /ping endpoint (200/204 = healthy)
- Post-deployment verification with configurable retries
- Async and sync health check methods
- Comprehensive unit tests
- Full documentation with architecture diagrams and examples

Architecture:
- Extends ServerlessResource with LB-specific behavior
- Validates configuration before deployment
- Polls /ping endpoint until healthy (10 retries × 5s = 50s timeout)
- Raises TimeoutError if endpoint fails to become healthy

This forms the foundation for Mothership architecture where a load-balanced
endpoint serves as a directory server for child endpoints.
Import ServerlessResource directly and use patch.object on the imported class
instead of string-based patches. This ensures the mocks properly intercept the
parent class's _do_deploy method when called via super(). Simplifies mock
configuration and removes an unused variable assertion.

Fixes the three failing deployment tests that were making real GraphQL API calls.
All tests now pass: 418 passed, 1 skipped.
…oints

Implement core infrastructure for enabling @Remote decorator on
LoadBalancerSlsResource endpoints with HTTP method/path routing.

Changes:
- Create LoadBalancerSlsStub: HTTP-based stub for direct endpoint execution
  (src/tetra_rp/stubs/load_balancer_sls.py, 170 lines)
  - Serializes functions and arguments using cloudpickle + base64
  - Direct HTTP POST to /execute endpoint (no queue polling)
  - Proper error handling and deserialization

- Register stub with singledispatch (src/tetra_rp/stubs/registry.py)
  - Enables @Remote to dispatch to LoadBalancerSlsStub for LB resources

- Extend @Remote decorator with HTTP routing parameters (src/tetra_rp/client.py)
  - Add 'method' parameter: GET, POST, PUT, DELETE, PATCH
  - Add 'path' parameter: /api/endpoint routes
  - Validate method/path required for LoadBalancerSlsResource
  - Store routing metadata on decorated functions/classes
  - Warn if routing params used with non-LB resources

Foundation for Phase 2 (Build system integration) and Phase 3 (Local dev).
Update RemoteDecoratorScanner to extract HTTP method and path from
@Remote decorator for LoadBalancerSlsResource endpoints.

Changes:
- Add http_method and http_path fields to RemoteFunctionMetadata
- Add _extract_http_routing() method to parse decorator keywords
- Extract method (GET, POST, PUT, DELETE, PATCH) from decorator
- Extract path (/api/process) from decorator
- Store routing metadata for manifest generation

Foundation for Phase 2.2 (Manifest updates) and Phase 2.3 (Handler generation).
Enhance ManifestBuilder to support HTTP method/path routing for
LoadBalancerSlsResource endpoints.

Changes:
- Add http_method and http_path fields to ManifestFunction
- Validate LB endpoints have both method and path
- Detect and prevent route conflicts (same method + path)
- Prevent use of reserved paths (/execute, /ping)
- Add 'routes' section to manifest for LB endpoints
- Conditional inclusion of routing fields (only for LB)

Manifest structure for LB endpoints now includes:
{
  "resources": {
    "api_service": {
      "resource_type": "LoadBalancerSlsResource",
      "functions": [
        {
          "name": "process_data",
          "http_method": "POST",
          "http_path": "/api/process"
        }
      ]
    }
  },
  "routes": {
    "api_service": {
      "POST /api/process": "process_data"
    }
  }
}
Implement LBHandlerGenerator to create FastAPI applications for
LoadBalancerSlsResource endpoints with HTTP method/path routing.

Key features:
- Generates FastAPI apps with explicit route registry
- Creates (method, path) -> function mappings from manifest
- Validates route conflicts and reserved paths
- Imports user functions and creates dynamic routes
- Includes required /ping health check endpoint
- Validates generated handler Python syntax via import

Generated handler structure enables:
- Direct HTTP routing to user functions via FastAPI
- Framework /execute endpoint for @Remote stub execution
- Local development with uvicorn
Create create_lb_handler() factory function that dynamically builds FastAPI
applications from route registries for LoadBalancerSlsResource endpoints.

Key features:
- Accepts route_registry: Dict[(method, path)] -> handler_function mapping
- Registers all user-defined routes from registry to FastAPI app
- Provides /execute endpoint for @Remote stub function execution
- Handles async function execution automatically
- Serializes results with cloudpickle + base64 encoding
- Comprehensive error handling with detailed logging

The /execute endpoint enables:
- Remote function code execution via @Remote decorator
- Automatic argument deserialization from cloudpickle/base64
- Result serialization for transmission back to client
- Support for both sync and async functions
Update build command to use appropriate handler generators based on
resource type. Separates LoadBalancerSlsResource endpoints (using FastAPI)
from queue-based endpoints (using generic handler).

Changes:
- Import LBHandlerGenerator alongside HandlerGenerator
- Inspect manifest resources and separate by type
- Generate LB handlers via LBHandlerGenerator
- Generate QB handlers via HandlerGenerator
- Combine all generated handler paths for summary

Enables users to mix LB and QB endpoints in same project with correct
code generation for each resource type.
Implement LiveLoadBalancer resource following the LiveServerless pattern
for local development and testing of load-balanced endpoints.

Changes:
- Add TETRA_LB_IMAGE constant for load-balanced Tetra image
- Create LiveLoadBalancer class extending LoadBalancerSlsResource
- Uses LiveServerlessMixin to lock imageName to Tetra LB image
- Register LiveLoadBalancer with LoadBalancerSlsStub in singledispatch
- Export LiveLoadBalancer from core.resources and top-level __init__

This enables users to test LB-based functions locally before deploying,
using the same pattern as LiveServerless for queue-based endpoints.

Users can now write:
  from tetra_rp import LiveLoadBalancer, remote

  api = LiveLoadBalancer(name="test-api")

  @Remote(api, method="POST", path="/api/process")
  async def process_data(x, y):
      return {"result": x + y}

  result = await process_data(5, 3)  # Local execution
Implement unit tests for LoadBalancerSlsStub covering:
- Request preparation with arguments and dependencies
- Response handling for success and error cases
- Error handling for invalid responses
- Base64 encoding/decoding of serialized data
- Endpoint URL validation
- Timeout and HTTP error handling

Test coverage:
- _prepare_request: 4 tests
- _handle_response: 5 tests
- _execute_function: 3 error case tests
- __call__: 2 integration tests

Tests verify proper function serialization, argument handling,
error propagation, and response deserialization.
Fix test_load_balancer_vs_queue_based_endpoints by updating the @Remote
decorator to use method='POST' and path='/api/echo' to match the test
assertions. This was a test-level bug where the decorator definition
didn't match what was being asserted.
…ndpoints

- Using_Remote_With_LoadBalancer.md: User guide for HTTP routing, local development, building and deploying
- LoadBalancer_Runtime_Architecture.md: Technical details on deployment, request flows, security, and performance
- Updated README.md with LoadBalancer section and code example
- Updated Load_Balancer_Endpoints.md with cross-references to new guides
Split @Remote execution behavior between local and deployed:
- LiveLoadBalancer (local): Uses /execute endpoint for function serialization
- LoadBalancerSlsResource (deployed): Uses user-defined routes with HTTP param mapping

Changes:
1. LoadBalancerSlsStub routing detection:
   - _should_use_execute_endpoint() determines execution path
   - _execute_via_user_route() maps args to JSON and POSTs to user routes
   - Auto-detects resource type and routing metadata

2. Conditional /execute registration:
   - create_lb_handler() now accepts include_execute parameter
   - Generated handlers default to include_execute=False (security)
   - LiveLoadBalancer can enable /execute if needed

3. Updated handler generator:
   - Added clarity comments on /execute exclusion for deployed endpoints

4. Comprehensive test coverage:
   - 8 new tests for routing detection and execution paths
   - All 31 tests passing (22 unit + 9 integration)

5. Documentation updates:
   - Using_Remote_With_LoadBalancer.md: clarified /execute scope
   - Added 'Local vs Deployed Execution' section explaining differences
   - LoadBalancer_Runtime_Architecture.md: updated execution model
   - Added troubleshooting for deployed endpoint scenarios

Security improvement:
- Deployed endpoints only expose user-defined routes
- /execute endpoint removed from production (prevents arbitrary code execution)
- Lower attack surface for deployed endpoints
…lude /execute endpoint

- Modified manifest.py to validate LiveLoadBalancer endpoints like LoadBalancerSlsResource
- Updated lb_handler_generator to:
  - Include LiveLoadBalancer in handler generation filter
  - Pass include_execute=True for LiveLoadBalancer (local dev)
  - Pass include_execute=False for LoadBalancerSlsResource (deployed)
- Added integration tests:
  - Verify LiveLoadBalancer handlers include /execute endpoint
  - Verify deployed handlers exclude /execute endpoint
- Fixes critical bug: LiveLoadBalancer now gets /execute endpoint in generated handlers
…ss resources

- Updated scanner to extract LiveLoadBalancer and LoadBalancerSlsResource resources
- Previously only looked for 'Serverless' in class name, missing LoadBalancer endpoints
- Now checks for both 'Serverless' and 'LoadBalancer' in resource type names
- Added integration test to verify scanner discovers both resource types
- Fixes critical bug that prevented flash build from finding LoadBalancer endpoints
- Wrap long lines in manifest.py, lb_handler.py, and load_balancer_sls.py
- Remove unused httpx import in test_load_balancer_sls_stub.py
- Apply consistent formatting across codebase
- Scanner: Use exact type name matching instead of substring matching
  - Whitelist specific resource types to avoid false positives
  - Prevents matching classes like 'MyServerlessHelper' or 'LoadBalancerUtils'

- Type hints: Use Optional[str] for nullable fields in manifest
  - ManifestFunction.http_method and http_path now properly typed

- Timeout: Make HTTP client timeout configurable
  - Added LoadBalancerSlsStub.DEFAULT_TIMEOUT class attribute
  - Added timeout parameter to __init__
  - Updated both _execute_function and _execute_via_user_route to use self.timeout

- Deprecated datetime: Replace datetime.utcnow() with datetime.now(timezone.utc)
  - Updated manifest.py and test_lb_remote_execution.py
  - Ensures Python 3.12+ compatibility
The set_serverless_template model_validator was being overwritten by sync_input_fields
(both had mode="after"). In Pydantic v2, when two validators with the same mode are
defined in a class, only one is registered.

This caused templates to never be created from imageName, resulting in:
  "GraphQL errors: One of templateId, template is required to create an endpoint"

Solution:
- Move set_serverless_template validator from ServerlessResource base class to subclasses
  (ServerlessEndpoint and LoadBalancerSlsResource) where the validation is actually needed
- Keep helper methods (_create_new_template, _configure_existing_template) in base class
  for reuse
- Add comprehensive tests for LiveLoadBalancer template serialization

This allows:
1. Base ServerlessResource to be instantiated freely for testing/configuration
2. Subclasses (ServerlessEndpoint, LoadBalancerSlsResource) to enforce template
   requirements during deployment
3. Proper template serialization in GraphQL payload for RunPod API

Fixes: One of templateId, template is required to create an endpoint error when
deploying LiveLoadBalancer with custom image tags like runpod/tetra-rp-lb:local
- Fix: Use correct endpoint URL format for load-balanced endpoints
  (https://{id}.api.runpod.ai instead of https://api.runpod.ai/v2/{id})
  This fixes 404 errors on /ping health check endpoints

- Feature: Add CPU LoadBalancer support
  * Create CpuLoadBalancerSlsResource for CPU-based load-balanced endpoints
  * Create CpuLiveLoadBalancer for local CPU LB development
  * Add TETRA_CPU_LB_IMAGE constant for CPU LB Docker image
  * Update example code to use CpuLiveLoadBalancer for CPU worker
  * Add 8 comprehensive tests for CPU LoadBalancer functionality

- Tests: Add 2 tests for endpoint URL format validation
- All 474 tests passing, 64% code coverage
…etra_rp package

LoadBalancer resources were not being discovered by ResourceDiscovery because
the new CPU variants (CpuLiveLoadBalancer, CpuLoadBalancerSlsResource) were
not exported from the main tetra_rp package. This prevented undeploy from
picking up these resources.

Added exports to:
- TYPE_CHECKING imports for type hints
- __getattr__ function for lazy loading
- __all__ list for public API

This fixes the issue where 'flash undeploy list' could not find LoadBalancer
resources that were deployed with 'flash run --auto-provision'.
deanq added 2 commits January 9, 2026 02:05
Update documentation to consistently use 'Manifest' instead of 'Directory':
- Replace DirectoryClient references with StateManagerClient (actual implementation)
- Update architecture diagram to reference /manifest endpoint instead of DirectoryClient
- Fix ServiceRegistry code examples to use /manifest endpoint
- Update extension point for custom directory backends
- Fix testing section to reference actual test files (MothershipProvisioner, StateManagerClient)
- Update debugging section with /manifest endpoint examples
- Clarify that directory is loaded from mothership /manifest endpoint

These changes ensure documentation matches the actual AE-1660 implementation.
Critical fix: Update ManifestClient to query /manifest endpoint instead of /directory

Changes:
- Fix ManifestClient.get_directory() to query /manifest endpoint (not /directory)
- Update ManifestClient docstring: 'manifest directory service' → '/manifest endpoint'
- Fix DirectoryUnavailableError → ManifestServiceUnavailableError in docs
- Update example URLs from 'api.runpod.io' to actual LB endpoint format
- Clarify in docstrings that this queries the mothership's /manifest endpoint

This bug would have caused runtime failures when querying the mothership directory,
as the actual endpoint served by lb_handler_generator.py is /manifest, not /directory.
@deanq deanq changed the base branch from main to deanq/ae-1196-absolute-drift-detection January 9, 2026 10:18
Base automatically changed from deanq/ae-1196-absolute-drift-detection to main January 12, 2026 04:12
@deanq deanq requested a review from Copilot January 12, 2026 10:26
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements automatic child endpoint provisioning for the Mothership (LoadBalancerSlsResource) using manifest reconciliation. The mothership reads the local manifest file on boot, reconciles it with a persisted manifest from State Manager, and automatically deploys, updates, or deletes child resources to match the desired state.

Changes:

  • Added mothership auto-provisioning system with intelligent reconciliation logic
  • Implemented State Manager client for persistent manifest state tracking
  • Added /manifest endpoint for service discovery
  • Fixed endpoint bug in manifest_client.py (corrected /directory to /manifest)

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/unit/runtime/test_mothership_provisioner.py Comprehensive unit tests covering provisioner functions including URL construction, manifest loading, hash computation, reconciliation logic, and directory extraction
tests/integration/test_mothership_provisioning.py Integration tests for end-to-end provisioning workflows including first boot, changes detection, resource removal, error handling, and idempotency
tests/integration/test_lb_remote_execution.py Updated test assertions to include new lifespan parameter in handler creation
src/tetra_rp/runtime/state_manager_client.py New HTTP client for State Manager API with methods for fetching/updating/removing persisted manifest state
src/tetra_rp/runtime/mothership_provisioner.py Core provisioning logic implementing manifest reconciliation, resource deployment/update/deletion, and directory mapping
src/tetra_rp/runtime/manifest_client.py Fixed endpoint from /directory to /manifest with updated documentation
src/tetra_rp/runtime/lb_handler.py Added lifespan parameter to handler creation for startup/shutdown hooks
src/tetra_rp/core/resources/load_balancer_sls_resource.py Sets FLASH_IS_MOTHERSHIP=true env var during deployment
src/tetra_rp/cli/commands/build_utils/lb_handler_generator.py Added lifespan context manager with mothership provisioning logic and /manifest endpoint
docs/Cross_Endpoint_Routing.md Updated documentation with mothership auto-provisioning architecture and terminology corrections

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/tetra_rp/runtime/mothership_provisioner.py Outdated
Comment thread src/tetra_rp/cli/commands/build_utils/lb_handler_generator.py Outdated
deanq and others added 9 commits January 12, 2026 13:30
Changes FLASH_MOTHERSHIP_URL to FLASH_MOTHERSHIP_ID for cleaner
environment configuration. Child endpoints now use FLASH_RESOURCE_NAME
to identify which resource config they represent in the manifest.

Changes:
- ManifestClient: Construct URL from FLASH_MOTHERSHIP_ID instead of full URL
- ServiceRegistry: Use FLASH_RESOURCE_NAME with fallback to RUNPOD_ENDPOINT_ID
- Add tomli dependency for Python <3.11 pyproject.toml parsing (needed for build.py)

Benefits:
- Simpler environment configuration (ID instead of full URL)
- Clear distinction between mothership (RUNPOD_ENDPOINT_ID) and children (FLASH_RESOURCE_NAME)
- Consistent URL construction pattern

Files modified:
- src/tetra_rp/runtime/manifest_client.py
- src/tetra_rp/runtime/service_registry.py
- pyproject.toml
- uv.lock
Removes LoadBalancer resource filtering to enable multi-tier
architectures. Adds cache validation to prevent stale resources
from being deployed after codebase refactoring.

Provisioning Changes:
- Remove LoadBalancer filtering in reconcile_manifests()
- Support CpuLiveLoadBalancer, LiveLoadBalancer, LoadBalancerSlsResource
- Add filter_resources_by_manifest() to validate cached resources against manifest
- Add test-mothership mode with "tmp-" prefix for temporary test endpoints
- Change env vars: FLASH_MOTHERSHIP_URL -> FLASH_MOTHERSHIP_ID

Resource Manager Changes:
- Track all created resources (deployed = has ID) regardless of health status
- Cache resources even if deployment completes with errors
- Ensures cleanup capability for all created resources

Cache Validation:
- Prevents stale resources from old codebase versions being redeployed
- Validates: resource name exists in manifest + type matches
- Logs removed stale entries for visibility

Benefits:
- Multi-tier load balancing architectures now supported
- No orphaned resources from refactored code
- Better resource lifecycle management
- Reliable cleanup of all created resources

Files modified:
- src/tetra_rp/runtime/mothership_provisioner.py
- src/tetra_rp/core/resources/resource_manager.py
…ements

Enables bundling local tetra_rp source into builds for development and
testing. Updates LB handler to serve authoritative manifest from State Manager.

Build System Changes:
- Add _find_local_tetra_rp() to detect development installations
- Add _bundle_local_tetra_rp() to copy source into build directory
- Add _extract_tetra_rp_dependencies() to parse pyproject.toml for deps
- Add _remove_tetra_from_requirements() to clean up after bundling
- Skip bundling for PyPI installations (site-packages)

LB Handler Changes:
- Store StateManagerClient in module-level state for /manifest endpoint
- Update /manifest endpoint to fetch from State Manager (single source of truth)
- Add proper error handling for uninitialized state client
- Restrict /manifest endpoint to mothership only (403 for children)
- Improve provisioning startup logging for clarity

Benefits:
- Test-mothership can use local tetra_rp changes without publishing
- Manifest endpoint serves complete authoritative state
- Child endpoints get consistent configuration from single source
- Better development workflow for framework changes

Files modified:
- src/tetra_rp/cli/commands/build.py
- src/tetra_rp/cli/commands/build_utils/lb_handler_generator.py
Adds --force flag to undeploy for non-interactive cleanup (needed by
test-mothership). Improves resource discovery visibility with debug logging.

Undeploy Changes:
- Add --force/-f flag to skip confirmation prompts
- Update _undeploy_by_name(), _undeploy_all(), _interactive_undeploy() to support skip_confirm
- Enables automated cleanup in CI/CD and test-mothership shutdown

Discovery Changes:
- Add detailed logging at each discovery phase (entry point, static imports, directory scan)
- Log discovered resource names and types for debugging
- Exclude .flash/ directory from project scanning (build artifacts)

Run Command Changes:
- Add resource discovery debug output showing found resources
- Display resource names and types before server startup

CLI Main Changes:
- Register test-mothership command (note: implementation was in commit 1)

Benefits:
- Test-mothership can cleanup automatically without user interaction
- Better visibility into resource discovery process
- Easier debugging of discovery issues
- Clean separation of interactive vs automated workflows

Files modified:
- src/tetra_rp/cli/commands/undeploy.py
- src/tetra_rp/cli/commands/run.py
- src/tetra_rp/core/discovery.py
- src/tetra_rp/cli/main.py
Updates all tests to reflect LoadBalancer provisioning, FLASH_RESOURCE_NAME
usage, and removal of obsolete test cases.

Mothership Provisioner Tests:
- Update tests to expect LoadBalancer resources in provisioning (not skipped)
- Fix create_resource_from_manifest tests to use RUNPOD_ENDPOINT_ID env var
- Update UnsupportedResourceType test (LoadBalancer now supported)
- Remove obsolete get_manifest_directory() tests (function removed)

Service Registry Tests:
- Update all tests to use FLASH_RESOURCE_NAME instead of RUNPOD_ENDPOINT_ID
- Add test for FLASH_RESOURCE_NAME priority with RUNPOD_ENDPOINT_ID fallback
- Update test names to reflect new behavior

Integration Tests:
- Update test_provision_children_skips_load_balancer_resources to
  test_provision_children_deploys_load_balancer_resources
- Fix assertions to expect 2 deployments (LoadBalancer + worker)
- Remove obsolete test_manifest_directory_endpoint_after_provisioning

Manifest Client Tests:
- Update initialization tests for FLASH_MOTHERSHIP_ID usage
- Update error message expectations

Test Rationale:
- LoadBalancer provisioning enables multi-tier architectures
- FLASH_RESOURCE_NAME provides clearer child endpoint identification
- Removed tests for deleted functionality (get_manifest_directory)

Files modified:
- tests/unit/runtime/test_mothership_provisioner.py
- tests/unit/runtime/test_service_registry.py
- tests/integration/test_mothership_provisioning.py
- tests/unit/runtime/test_manifest_client.py
@deanq deanq changed the title feat(mothership): implement auto-provisioning with manifest reconciliation feat(mothership): implement auto-provisioning with manifest Jan 13, 2026
deanq added 2 commits January 14, 2026 00:38
…irectories

Changes:
- Modified LBHandlerGenerator to use importlib pattern instead of from imports
- Aligns LB handlers with QB handler pattern for consistency
- Fixes SyntaxError when building projects with numeric directory names (e.g., 03_advanced_workers)
- Added boolean flags (is_load_balanced, is_live_resource) to replace string comparisons
- Added test coverage for numeric module paths

The bug occurred because Python identifiers cannot start with digits, but
importlib treats module paths as strings, allowing any valid filesystem path.
Changes:
- Scanner now tracks config variable names (e.g., "gpu_config") at scan time
- Manifest includes config_variable field for each resource and function
- test-mothership uses config_variable from manifest for reliable discovery
- Added backward compatibility fallback to old search logic

Fixes "No config variable found" warnings when resource names differ from
variable names (e.g., resource "03_05_load_balancer_gpu" with variable "gpu_config").

This enables test-mothership to correctly discover and provision all resources
including load balancer endpoints, resolving health check failures.
deanq and others added 2 commits January 14, 2026 02:02
Changes:
- Replace MD5 with SHA-256 for config hash computation (security best practice)
- Add error callback to background provisioning task for proper exception handling
- Update tests to expect SHA-256 hash length (64 chars instead of 32)

Addresses Copilot review comments:
- mothership_provisioner.py:113 - Use SHA-256 instead of cryptographically broken MD5
- lb_handler_generator.py:81 - Track background task and add error callback
try:
client = await self._get_client()
response = await client.get(
f"{self.base_url}/api/v1/flash/manifests/{mothership_id}",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this still need to get updated to use the graphql endpoints?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes. What is the correct way?

…hip-deploys-manifest

# Conflicts:
#	tests/integration/test_lb_remote_execution.py
@deanq deanq merged commit 14effd4 into main Jan 14, 2026
7 checks passed
@deanq deanq deleted the deanq/ae-1660-mothership-deploys-manifest branch January 14, 2026 22:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants